Distributional analysis is a term I coined for a very simple yet powerful way of analyzing datasets. It means that you think of the dataset as a distribution within a large multidimensional space, which you then can examine through its marginal statistics in any two-dimensional subspace.
This is my favorite tool for exploratory data analysis—meaning you don't know what you're looking for just yet, you're simply getting to know the data. The best way to understand this is through examples. So let's turn our attention to one of my favorite datasets, the global surface drifter dataset.
Surface instruments record latitude and longitude—and therefore horizontal velocity—as well as, sometimes, temperature as they drift following the ocean currents. We're using this classic dataset:
Lumpkin, Rick; Centurioni, Luca (2019). Global Drifter Program quality-controlled 6-hour interpolated data from ocean surface drifting buoys. NOAA National Centers for Environmental Information. Dataset. https://doi.org/10.25921/7ntx-z961.
If you want to run this lab yourself, as opposed to simply viewing it, you'll need to grab the single NetCDF file version of this dataset. That file is currently called gdp_jul22_ragged_6h.nc and can be downloaded from here.
#install needed packages using
#conda install -c conda-forge numpy matplotlib xarray dask netCDF4 bottleneck scipy pandas cartopy ipython ipympl nodejs cmasher jupyterlab git pip seaborn
#or its equivalent
import time
import numpy as np
import matplotlib.pyplot as plt
import xarray as xr
from scipy import stats
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import warnings #ignore warnings for readability
warnings.filterwarnings('ignore')
#load in hourly drifter dataset
#timing this script
tic = time.time()
datadir = '/Users/lilly/Desktop/Dropbox/NetCDf/' #path to the directory where you've put floats.nc
drifters = xr.open_dataset(datadir + 'gdp_jul22_ragged_6h.nc')
drifters
<xarray.Dataset>
Dimensions: (traj: 26843, obs: 44544647)
Coordinates:
ids (obs) int64 ...
time (obs) datetime64[ns] ...
lon (obs) float32 ...
lat (obs) float32 ...
Dimensions without coordinates: traj, obs
Data variables: (12/44)
ID (traj) int64 ...
rowsize (traj) int32 ...
WMO (traj) int32 ...
expno (traj) int32 ...
deploy_date (traj) datetime64[ns] ...
deploy_lat (traj) float32 ...
... ...
vn (obs) float32 ...
temp (obs) float32 ...
err_lat (obs) float32 ...
err_lon (obs) float32 ...
err_temp (obs) float32 ...
drogue_status (obs) bool ...
Attributes: (12/16)
title: Global Drifter Program six-hourly drifting buoy collec...
history: Last update July 2022. Metadata from dirall.dat and d...
Conventions: CF-1.6
date_created: 2022-12-08T18:44:27.784441
publisher_name: GDP Drifter DAC
publisher_email: aoml.dftr@noaa.gov
... ...
contributor_name: NOAA Global Drifter Program
contributor_role: Data Acquisition Center
institution: NOAA Atlantic Oceanographic and Meteorological Laboratory
acknowledgement: Lumpkin, Rick; Centurioni, Luca (2019). NOAA Global Dr...
summary: Global Drifter Program six-hourly data
doi: 10.25921/7ntx-z961# set up some need variables and default plotting properties
lon = drifters.lon.values
lat = drifters.lat.values
u = drifters.ve.values*100 #convert to cm/s
v = drifters.vn.values*100 #convert to cm/s
sst = drifters.temp.values
tim = drifters.time.values
#remove data points containing NaNs
bool = ~np.isnan(u) & ~np.isnan(v) & ~np.isnan(lat) & ~np.isnan(lon)
lat = lat[bool]
lon = lon[bool]
u = u[bool]
v = v[bool]
sst = sst[bool]
tim = tim[bool]
figsize=np.array([18, 12]);
projection=ccrs.PlateCarree();
#some other projections we might try
#projection=ccrs.Robinson();
#projection=ccrs.LambertCylindrical();
#choose a default colormap
cmap = plt.cm.get_cmap('Spectral_r', 64)
#some other colormaps. Note that the '_r' reverses the map
#cmap = plt.cm.get_cmap('plasma_r', 40)
#cmap = plt.cm.get_cmap('magma_r', 40)
#cmap = plt.cm.get_cmap('inferno_r', 40)
#cmap = plt.cm.get_cmap('viridis_r', 40)
#cmap = plt.cm.get_cmap('RdYlBu_r', 64)
#for future reference, define a function to set up our map
def map_setup(figsize,projection):
fig = plt.figure(figsize=figsize)
ax = fig.add_subplot(projection=projection)
ax.add_feature(cfeature.LAND, facecolor='grey')
ax.gridlines(draw_labels=True)
ax.set_ylim(-80,80)
return fig, ax
Let's make a basic plot of the drifter dataset. First we will loop over all of the trajectories in order to plot each trajectory with a different color.
fig, ax = map_setup(figsize,projection)
#find the head (a) and tail (b) of each segment
b = np.cumsum(drifters.rowsize)-1
a = b-drifters.rowsize+1
#in what follows, we need to work with the original data drifters.lat and drifters.lat, not lat and
#lon created above to have NaNs removed, because drifters.rowsize applies to the original data
#set lon to nan when we jump across the dateline, for plotting purposes
lon_nojumps = drifters.lon.values
lon_nojumps[np.abs(lon_nojumps-np.roll(lon_nojumps, -1))>10] = np.nan; #nan for lon jumps > 10 degrees
#plot each float trajectory with its own color ... this takes a minute
for n in range(0,np.size(drifters.ID.values)):
#index into values belonging to the nth trajectory
index = np.arange(a.item(n),b.item(n))
#note you can't use a[n],b[n] because those are length 1 arrays, not scalars
if index.size > 0:
plt.plot(lon_nojumps[index], drifters.lat.values[index],transform=ccrs.PlateCarree(),linewidth=0.5)
#The 'transform = ccrs.PlateCarree()' specifies that the data are given as latitude and longitude
#By default, the data is assumed to be in the same projection as that of the axis object
However, this doesn't tell us much quantitatively. To go further we're going to start exploring this dataset as a distribution. The first step is to plot the two dimensional histogram of observation locations in latitude–longitude space.
#In the code below, stats.binned_statistic_2d() is the central function, which we can use to compute
#a number of different statistics. First we are interested in the number of observations in each bin
#specified by statistic='count'. The third argument, where observation values would normally go, can
#be 'None' in such a case.
fig, ax = map_setup(figsize,projection)
dlatlon = 0.25
lonbins = np.arange(-180, 180, dlatlon)
latbins = np.arange(-80, 80, dlatlon)
hist = stats.binned_statistic_2d(lat, lon, None, bins=[latbins, lonbins], statistic='count') # Returns a object
hist.statistic[hist.statistic == 0] = np.nan # Sets all values of 0 to nan as log10(0) = -inf
#In using stats.binned_statistic_2d, the input argument needs to be a 1-D array. If you're using higher-
#dimensional data, such as model data, then you would want to flatten the input arrays with np.ravel().
#Similarly the input arguments should contain no NaNs. If they do, you would want to remove then with
#bool = ~np.isnan(x) & ~np.isnan(y)
#hist = stats.binned_statistic_2d(x[bool], y[bool], None, bins=[xbins,ybins], statistic='mean').statistic
image = plt.pcolormesh(lonbins, latbins, np.log10(hist.statistic), cmap=cmap, shading='flat', \
transform=ccrs.PlateCarree())
plt.title('Distribution of Six-Hourly Data Points')
plt.clim(1,2.75)
fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, \
label='Log10 Number of Observations');
We see that the observation density is high in the central gyres, particularly in the North Atlantic, and is low in the Arctic, Antarctic, upwelling zones, and shallow seas.
Let's take a look now at the distribution in time. For this, it will be useful to use latitude as one axis and time as the other.
#distribution of data points over latitude and time
#is this really the easiest way to do this stuff?
yr=tim.astype('datetime64[Y]').astype(int)+1970
dy=tim.astype('datetime64[D]').astype(int) - tim.astype('datetime64[Y]').astype('datetime64[D]').astype(int)+1
#dy is yearday, formed as (day) - (day for the Jan 1 of the current year)
fig, ax = plt.subplots(1, 2,figsize=np.array([18, 10]))
plt.sca(ax[0])
xbins = np.arange(1978, 2023)
hist = stats.binned_statistic_2d(lat, yr, None, bins=[latbins, xbins], statistic='count') # Returns a object
hist.statistic[hist.statistic == 0] = np.nan # Sets all values of 0 to nan as log10(0) = -inf
image = plt.pcolormesh(xbins, latbins, np.log10(hist.statistic), cmap=cmap, shading='flat')
plt.title('Data Distribution Over Year and Latitude')
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Latitude')
plt.sca(ax[1])
xbins = np.arange(0, 366, 5)
hist = stats.binned_statistic_2d(lat, dy, None, bins=[latbins, xbins], statistic='count') # Returns a object
hist.statistic[hist.statistic == 0] = np.nan # Sets all values of 0 to nan as log10(0) = -inf
image = plt.pcolormesh(xbins, latbins, np.log10(hist.statistic), cmap=cmap, shading='flat')
plt.title('Data Distribution Over Yearday and Latitude')
ax[1].set_xlabel('Day of Year')
plt.yticks([])
plt.subplots_adjust(wspace=0.025)
fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=30, pad=0.08, label='Log10 Number of Data Points');
fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=30, pad=0.08, label='Log10 Number of Data Points');
We see that there is much interannual variability of the data distribution, but little annual variability. This is good news, because it means aliasing of the annual cycle, which we expect to be stronger than the interannual variability in this case, won't impact the mean statistics.
It's very useful to also examine basic statistics as a function of two parameters. The most obvious choices for the axes of such two-dimensional statistics in this case would again be latitude and longitude.
#in working with binned_statistic_2d, we need to make sure we have removed NaNs, which
#we did just after we loaded in the data
fig, ax = map_setup(figsize,projection)
ubar = stats.binned_statistic_2d(lat, lon, u, bins=[latbins, lonbins], statistic='mean')
vbar = stats.binned_statistic_2d(lat, lon, v, bins=[latbins, lonbins], statistic='mean')
meanflow = np.sqrt(ubar.statistic ** 2 + vbar.statistic ** 2)
image = plt.pcolormesh(lonbins, latbins, meanflow, cmap=cmap, shading='flat',transform=ccrs.PlateCarree())
plt.title('Speed of Mean Surface Currents')
plt.clim(0, 80)
fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Speed (cm/s)');
We see the major current systems as thin ribbons: the Gulf Stream, the Kuroshio, the Brazil Current, the Agulhas Current, the equatorial current system, and so forth.
A slightly different quantity is the mean of the current speed, rather than the speed or magnitude of the mean flow.
The mean speed paints the currents as slightly broader ribbons, because now the magnitude is not decreased by the variability of a current's direction.
A measure of the variability is the velocity standard deviation, which we look at next.
ustd = stats.binned_statistic_2d(lat, lon, u, bins=[latbins, lonbins], statistic='std').statistic
vstd = stats.binned_statistic_2d(lat, lon, v, bins=[latbins, lonbins], statistic='std').statistic
std = np.sqrt(ustd ** 2 + vstd ** 2)
std[std == 0] = np.nan #set zero values of standard deviation to nans
fig, ax = map_setup(figsize,projection)
image = plt.pcolormesh(lonbins, latbins, std, cmap=cmap, shading='flat', transform=ccrs.PlateCarree())
plt.title('Standard Deviation of Surface Currents')
plt.clim(0, 80)
fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Standard Deviation (cm/s)');
To see how these compare we can examine the coefficient of variation, defined as the ratio of the standard devation to the mean. This is a nondimensional quantity showing the relative magnitude of variability. The larger the coefficient of variation, the larger typical deviations from the mean are with respect to the mean itself.
Note that the coefficient of variation is only meaningful when there is a true zero, that is for quantities like speed, height, amount of precipitation or radiation, etc. Such data is called ratio data in statistics. One would not use it for, say, temperature, because while 0 K is well defined, it is not a meaningful reference zero for typical planetary processes.
For plotting the coefficient of variation, it is useful to take the log, as this has the nice property that log(1/a)=-log(a).
fig, ax = map_setup(figsize,projection)
image = plt.pcolormesh(lonbins, latbins, np.log2(np.divide(std,meanflow)), \
cmap=plt.cm.get_cmap('BrBG_r', 64), shading='flat', transform=ccrs.PlateCarree())
plt.title('Coefficent of Variation (Ratio of Standard Deviation to Mean)')
plt.clim(-3,3)
fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Log2 Coefficent of Variation (nondimensional)');
Let's change now to another quantity, the sea surface temperature.
However, note that the number of data points involved for which the temperature is measured is considerably smaller than the total. Thus, if we were going to look further at this, we would also want to redo the data distributions we presented earlier, showing only the number of data points for which sea surface temperature is not NaN.
For expediency, we will skip this step and go straight to the two-dimensional statistics.
bool = ~np.isnan(sst)
fig, ax = map_setup(figsize,projection)
sstbar = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='mean').statistic
image = plt.pcolormesh(lonbins, latbins, sstbar, cmap=cmap, shading='flat', transform=ccrs.PlateCarree())
plt.title('Mean Sea Surface Temperature')
plt.clim(0, 30)
fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Temperature (deg C)');
Note the multi-scale structure present here: global, regional, and small-scale or mesoscale.
The large-scale temperature gradient is dominating the picture, obscuring the other variability. To account for this, we can subtract the longitudinally-averaged temperature at each latitude. This forms the temperature deviation from the zonal average.
#temperature deviation from zonal average
fig, ax = map_setup(figsize,projection)
#If anyone knows an easier way to tile an array into a matrix, let me know!
sstbarlat=np.repeat(np.atleast_2d(np.nanmean(sstbar, axis=1)).T,np.size(sstbar,axis=1),axis=1)
image = plt.pcolormesh(lonbins, latbins, sstbar-sstbarlat, cmap=cmap, shading='flat', transform=ccrs.PlateCarree())
plt.title('Temperature Deviation from Zonal Average')
plt.clim(-7, 7)
fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Temperature (deg C)');
We see that subtracting the zonal mean reveals smaller-scale structures. We now see more clearly that generally speaking, it's colder on the eastern side than on the western sides of ocean basins.
Finally we look at the sea surface temperature standard deviation.
#sea surface temperature standard deviation
bool = ~np.isnan(sst)
fig, ax = map_setup(figsize,projection)
sststd = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='std').statistic
image = plt.pcolormesh(lonbins, latbins, sststd, cmap=cmap, shading='flat', transform=ccrs.PlateCarree())
plt.title('Sea Surface Temperature Standard Deviation')
plt.clim(0, 6)
fig.colorbar(image, ax=ax, orientation='vertical', shrink=0.5, aspect=30, pad=0.05, label='Temperature Standard Deviation (deg C)');
This shows a remarkably strong asymmetry between the northern and southern hemispheres.
Regions of the highest mean sea surface temperature—such as the tropical warm pools—and also regions of the lowest mean sea surface temperature—the polar regions— are both regions of notably low temperature variability.
The regions of highest tempertature variability are the eddying or high velocity variance regions of the northern subtropical western boundary currents, specifically the Gulf Stream and Kuroshio, as well as relatively shallow seas such as the Sea of Okhotsk and the Black Sea.
For a while now we been looking at the statistics in latitude–longitude space. However, there are many other perspectives from which we could view this data. Let's start by returning to temperature and trying another way to examine the large-scale gradient.
#mean sea surface temperature, with latitudinal profile
bool = ~np.isnan(sst)
sstbar = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='mean').statistic
xbins=np.arange(-3, 32,.1)
latmid=latbins[0:-1]+1/2*(latbins[1]-latbins[0])
latmat=np.repeat(np.atleast_2d(latmid).T,np.size(sstbar,axis=1),axis=1)
mat = stats.binned_statistic_2d(latmat[~np.isnan(sstbar)],sstbar[~np.isnan(sstbar)], None, \
bins=[latbins, xbins], statistic='count').statistic
fig, ax = plt.subplots(1, 2, width_ratios=[3, 1],figsize=np.array([18, 10]),sharey=True)
plt.sca(ax[0])
image = plt.pcolormesh(lonbins, latbins, sstbar, cmap=cmap, shading='flat')
plt.title('Mean Sea Surface Temperature')
plt.clim(0, 30)
#ax[0].set_aspect(1.25)
#It seems setting the aspect ratio of one plot breaks sharey, so I have to manually set the aspect ratio of the other
plt.sca(ax[1])
image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat')
plt.title('SST Latitudinal Profile')
plt.clim(0,1.85)
ax[1].set_aspect(0.364)
ax[1].yaxis.tick_right()
ax[1].set_xlabel('Mean SST at Each Gridpoint (deg C)')
plt.subplots_adjust(wspace=0.025)
fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=30, pad=0.08, label='Temperature (deg C)');
fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=30/3, pad=0.08, label='Log10 Number of Grid Points');
The right-hand panel now has latitude as its y-axis, but its x-axis is the mean temperature at each gridpoint. We can think of this as rotating the data distribution: rather than by looking at it from the 'top', with latitidue and longitude as the axes and the temperature flattened, we turn it to the side such that longitude disappears, but temperature appears as an explicit dimension.
The right-hand panel now summarizes the distribution along the longitude axis from the left-hand panel. The different 'stripes' in the latitude–temperature distribution likely represent the different ocean basins, a hypothesis one could readily verify.
Let's continue with this approach and look at latitudinal profiles of number of different quantities.
#computations for latitudinal profiles
bool = ~np.isnan(sst)
sstbar = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='mean').statistic
sststd = stats.binned_statistic_2d(lat[bool], lon[bool], sst[bool], bins=[latbins, lonbins], statistic='std').statistic
spdbar = stats.binned_statistic_2d(lat, lon, np.abs(u+1j*v), bins=[latbins, lonbins], statistic='mean').statistic
ubar = stats.binned_statistic_2d(lat, lon, u, bins=[latbins, lonbins], statistic='mean')
vbar = stats.binned_statistic_2d(lat, lon, v, bins=[latbins, lonbins], statistic='mean')
meanflow = np.sqrt(ubar.statistic ** 2 + vbar.statistic ** 2)
ustd = stats.binned_statistic_2d(lat, lon, u, bins=[latbins, lonbins], statistic='std').statistic
vstd = stats.binned_statistic_2d(lat, lon, v, bins=[latbins, lonbins], statistic='std').statistic
std = np.sqrt(ustd ** 2 + vstd ** 2)
std[std == 0] = np.nan #set zero values of standard deviation to nans
latmid=latbins[0:-1]+1/2*(latbins[1]-latbins[0])
latmat=np.repeat(np.atleast_2d(latmid).T,np.size(sstbar,axis=1),axis=1)
#latitudinal profiles
fig, ax = plt.subplots(1, 4,figsize=np.array([16, 9]))
plt.subplots_adjust(wspace=0.035)
plt.sca(ax[0])
xbins=np.arange(-3, 32,.1)
mat = stats.binned_statistic_2d(latmat[~np.isnan(sstbar)],sstbar[~np.isnan(sstbar)], None, \
bins=[latbins, xbins], statistic='count').statistic
image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat')
plt.title('Mean SST')
plt.ylabel('Latitude')
plt.xlabel('Temperature (deg C)')
plt.sca(ax[1])
xbins=np.arange(0,7,0.025)
mat = stats.binned_statistic_2d(latmat[~np.isnan(sststd)],sststd[~np.isnan(sststd)], None, \
bins=[latbins, xbins], statistic='count').statistic
image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat')
plt.title('Temperature Standard Deviation')
plt.xlabel('Temperature Standard Deviation (deg C)')
plt.sca(ax[2])
xbins=np.arange(0,60,1/3)
mat = stats.binned_statistic_2d(latmat[~np.isnan(meanflow)],meanflow[~np.isnan(meanflow)], None, \
bins=[latbins, xbins], statistic='count').statistic
image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat')
plt.title('Speed of Mean Flow')
plt.xlabel('Velocity Magnitude (cm/s)')
plt.sca(ax[3])
xbins=np.arange(0,60,1/3)
mat = stats.binned_statistic_2d(latmat[~np.isnan(std)],std[~np.isnan(std)], None, \
bins=[latbins, xbins], statistic='count').statistic
image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat')
plt.title('Velocity Standard Deviation')
plt.xlabel('Velocity Standard Deviation (cm/s)')
#all colorbars are the same, so we can loop over like so
for n in range(4):
plt.sca(ax[n])
plt.clim(0,1.85)
fig.colorbar(image, ax=ax[n], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Log10 Number of Grid Points');
if n>0:
plt.yticks([])
These show the latitude distributions corresonding to latitude–longitude plots we made earlier.
One pattern which jumps out is that the temperature standard deviation is lowest at very low (equatorial) latitudes and very high negative latitudes, that is, where the temperature is warmest or coldest.
We can also examine the joint structure of these variables by taking the mean value of a third quantity. Some of these are interesting and some are less interesting. An interesting one is the joint structure of mean temperature and current speed.
#joint structure of temperature and velocity
fig, ax = plt.subplots(1, 2,figsize=np.array([16, 10]))
plt.subplots_adjust(wspace=0.025)
plt.sca(ax[0])
xbins=np.arange(-3, 32,.1)
mat = stats.binned_statistic_2d(latmat[~np.isnan(sstbar)],sstbar[~np.isnan(sstbar)], None, \
bins=[latbins, xbins], statistic='count').statistic
image = plt.pcolormesh(xbins, latbins, np.log10(mat), cmap=cmap, shading='flat')
plt.title('Distribution of Mean SST')
plt.clim(0,1.85)
plt.ylabel('Latitude')
plt.xlabel('Temperature (deg C)')
fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Log10 Number of Grid Points');
plt.sca(ax[1])
mat = stats.binned_statistic_2d(latmat[~np.isnan(sstbar)],sstbar[~np.isnan(sstbar)], meanflow[~np.isnan(sstbar)], \
bins=[latbins, xbins], statistic='mean').statistic
image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat')
plt.title('Mean Speed in Temperature / Latitude Bins')
plt.clim(0,40)
plt.xlabel('Temperature Standard Deviation (deg C)')
plt.yticks([])
fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Velocity Magnitude (cm/s)');
The left panel again shows the distribution of grid points in bins on the latitude / mean SST plane, while the right panel shows the mean temperature in bins on that same plane.
Note that the at midlatitudes, the warmest grid points—those on the far right-hand edge of the distribution—occur in the fastest currents. These are the strong, but spatially compact western boundary currents advecting warmer water poleward.
Finally, we can use this same approach to look at the annual cycle. We will now examine the mean current speed, and mean temperature, in bins on the latitude / day of year plane, as well as the mean values of their deviations from the zonal average.
#annual cycle of current speed
dy=tim.astype('datetime64[D]').astype(int) - tim.astype('datetime64[Y]').astype('datetime64[D]').astype(int)+1
#dy is yearday, formed as (day) - (day for the Jan 1 of the current year)
fig, ax = plt.subplots(1, 2,figsize=np.array([18, 10]))
plt.subplots_adjust(wspace=0.035)
xbins=np.arange(0, 366, 5)
plt.sca(ax[0])
mat = stats.binned_statistic_2d(lat, dy, np.abs(u+1j*v), bins=[latbins, xbins], statistic='mean').statistic
image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat')
plt.title('Current Speed Annual Cycle')
ax[0].set_xlabel('Yearday')
ax[0].set_ylabel('Latitude')
plt.clim(0,60)
fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Velocity (cm/s)');
plt.sca(ax[1])
mat = mat - np.repeat(np.atleast_2d(np.nanmean(mat, axis=1)).T,np.size(mat,axis=1),axis=1)
image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat')
plt.title('Zonal Deviation of Current Speed Annual Cycle')
ax[0].set_xlabel('Yearday')
ax[0].set_ylabel('Latitude')
plt.clim(-12,12)
plt.yticks([])
fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Velocity (cm/s)');
#annual cycle of temperature
dy=tim.astype('datetime64[D]').astype(int) - tim.astype('datetime64[Y]').astype('datetime64[D]').astype(int)+1
#dy is yearday, formed as (day) - (day for the Jan 1 of the current year)
bool = ~np.isnan(sst)
fig, ax = plt.subplots(1, 2,figsize=np.array([18, 10]))
plt.subplots_adjust(wspace=0.035)
xbins=np.arange(0, 366, 5)
plt.sca(ax[0])
mat = stats.binned_statistic_2d(lat[bool], dy[bool], sst[bool], bins=[latbins, xbins], statistic='mean').statistic
image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat')
plt.title('Sea Surface Temperature Annual Cycle')
ax[0].set_xlabel('Yearday')
ax[0].set_ylabel('Latitude')
plt.clim(0,31)
fig.colorbar(image, ax=ax[0], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Temperature (deg C)');
plt.sca(ax[1])
mat = mat - np.repeat(np.atleast_2d(np.nanmean(mat, axis=1)).T,np.size(mat,axis=1),axis=1)
image = plt.pcolormesh(xbins, latbins, mat, cmap=cmap, shading='flat')
plt.title('SST Zonal Deviation Annual Cycle')
ax[0].set_xlabel('Yearday')
ax[0].set_ylabel('Latitude')
plt.clim(-6,6)
plt.yticks([])
fig.colorbar(image, ax=ax[1], orientation='horizontal', shrink=0.8, aspect=20, pad=0.1, label='Temperature (deg C)');
From these two figures, we see that the current speed presents only a minor annual cycle, largely limited to the equatorial band, whereas the temperature unsurprisingly shows a very large annual cycle.
This exploratory analysis has given us an overview of the main features apparent in the data. At this point, we would begin identifying features worthy of further study and / or forming hypotheses to investigate in more detail.
print((time.time() - tic)/60)
3.5829830010732016